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(PTO TRANSMITTAL - NEW FILING) 



METHOD FOR SOLVING STOCHASTIC CONTROL PROBLEMS OF LINEAR 

SYSTEMS IN HIGH DIMENSION 



BACKGROUND OF THE INVENTION 

1 . Field of the Invention 

The present invention relates generally to a method for solving stochastic control 
problems of linear systems in high dimensions. 

2. Description of Related Art 

(Note: This application references a number of different publications as indicated 
throughout the specification by reference numbers enclosed in brackets, e.g., [x]. A list 
of these different publications ordered according to these reference numbers can be found 
in the Section entitled "References" in the "Detailed Description of the Preferred 
Embodiment." Each of these publications is incorporated by reference herein.) 

Computer-implemented Supply Chain Management (SCM) applications are 
designed to link a cohesive production and distribution network and thus allow an 
enterprise to track and streamline the flow of materials and data through the process of 
manufacturing and distribution to customers. SCM applications represent a significant 
evolution from previous enterprise resource planning (ERP) systems 

One goal of SCM applications is to decrease inventory costs by matching 
production to demand. SCM applications utilize extremely complex forecasting and 
planning algorithms to predict demand based upon information stored in the enterprise's 
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database. These applications also incorporate any changes in supply chain data into the 
forecast much faster than previous modes of calculation, allowing enterprises to more 
accurately predict demand patterns and schedule production accordingly. 

Another goal of SCM applications is to reduce overall production costs by 
5 streamlining the flow of goods through the production process and improving 

information flow between the enterprise, its suppliers, and its distributors. Logistics- 
oriented systems, such as transportation, warehouse management, and factory scheduling 
applications, all contribute to reduced production costs. By ensuring real-time 
connectivity between the various parties in a supply chain, these applications decrease 
10 idle time, reduce the need to store inventory, and prevent bottlenecks in the production 
process. 

Yet another goal of SCM applications is to improve customer satisfaction by 
offering increased speed and adaptability. SCM applications allow the enterprise to 
reduce lead times, increase quality, and offer greater customization, enhancing the 
1 5 customer relationship and improving retention. 

SCM applications begin with forecasting and data mining applications analyzing 
information consolidated in the enterprise's database. Planning algorithms are used to 
generate a demand forecast upon which to base subsequent procurement orders and 
production schedules. 

20 Nonetheless, there is a need in the art for improved planning techniques for SCM 

applications, especially where a SCM application models a Markov Decision Process 
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(MDP), and the action space and the state space of the MDP model are continuous and 
related to each other through a system of linear constraints. 

SUMMARY OF THE INVENTION 
5 To overcome the limitations in the prior art described above, and to overcome 

other limitations that will become apparent upon reading and understanding the present 
specification, the present invention discloses a method, apparatus, and article of 
manufacture for solving stochastic control problems of linear systems in high dimensions 
by modeling a structured Markov Decision Process (MDP). A state space for the MDP is 

10 a polyhedron in a Euclidean space and one or more actions that are feasible in a state of 
the state space are linearly constrained with respect to the state. One or more 
approximations are built from above and from below to a value function for the state 
using representations that facilitate the computation of approximately optimal actions at 
any given state by linear programming. 

15 Various advantages and features of novelty, which characterize the invention, are 

pointed out with particularity in the claims annexed hereto and form a part hereof. 
However, for a better understanding of the invention, its advantages, and the objects 
obtained by its use, reference should be made to the drawings which form a further part 
hereof, and to accompanying descriptive matter, in which there is illustrated and 

20 described specific examples of an apparatus in accordance with the invention. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
Referring now to the drawings in which like reference numbers represent 
corresponding parts throughout: 

FIG. 1 is a block diagram illustrating an exemplary hardware environment used to 
5 implement the preferred embodiment of the present invention; and 

FIG. 2 is a flowchart that illustrates the general logic of a supply chain planning 
process according to the preferred embodiment of the present invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 
10 In the following description of the preferred embodiment, reference is made to the 

accompanying drawings, which form a part hereof, and in which is shown by way of 
illustration a specific embodiment in which the invention may be practiced. It is to be 
understood that other embodiments may be utilized and structural changes may be made 
without departing from the scope of the present invention. 

15 

Overview 

The present invention relates to structured Markov Decision Processes (MDP) 
where the state space is a polyhedron in a Euclidean space and the actions that are 
feasible in a state are linearly constrained with respect to the state. The present invention 
20 builds approximations from above and from below to the value function, using 

representations that facilitate the computation of approximately optimal actions at any 
given state by linear programming. 
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Environment 

FIG. 1 is a block diagram illustrating an exemplary environment used to 
implement the preferred embodiment of the present invention. One or more client 
5 computers 100, supplier systems 102, production systems 104, and/or distribution 

systems 106 communicate with a server computer 108. Each of the client computers 100, 
supplier systems 102, production systems 104, distribution systems 106, and the server 
computer 108 are typically comprised of one or more processors, memory, and other 
components, such data storage devices and data communications devices. 

10 The client computers 100, supplier systems 102, production systems 104, and/or 

distribution systems 106 typically execute one or more computer programs operating 
under the control of an operating system. These computer programs transmit requests to 
the server computer 108 for performing various functions and receive data from the 
server computer 108 in response to the requests. 

15 The server computer 108 also operates under the control of an operating system, 

and executes one or more computer programs, such as an interface 110, supply chain 
planning process 112, and database management system 114. The interface 110, supply 
chain planning process 1 12, and database management system 1 14 perform various 
functions related to supply chain management (or other applications), and may transmit 

20 data to the client computers 100, supplier systems 102, production systems 104, and/or 
distribution systems 106. 
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The server computer 108 manages one or more databases 116 stored on one or 
more data storage devices. In a preferred embodiment, the databases 116 store one or 
more vectors used by the supply chain planning process 112, such as resource vectors, 
cost vectors, action vectors, and other vectors. These vectors may be generated, inter 
5 alia, by an enterprise resource planning (ERP) system, a point-of-sale (POS) system, or a 
manufacturing supply and distribution (MSD) system. Those skilled in the art will 
recognize, however, that other embodiments may use different databases, or different 
programs to access the databases. 

Generally, the interface 110, supply chain planning process 1 12, database 

10 management system 1 14, and database 116 each comprise logic and/or data that is 

tangibly embodied in or retrievable from a computer-readable device, medium, carrier, or 
signal, e.g., a data storage device, a remote device accessible via a data communications 
device, etc. Moreover, these logic and/or data, when read, executed, and/or interpreted 
by the server computer 108, cause the server computer 108 to perform the steps necessary 

15 to implement and/or use the present invention. 

Thus, the present invention may be implemented as a method, apparatus, or article 
of manufacture using standard programming and/or engineering techniques to produce 
software, firmware, hardware, or any combination thereof. The term "article of 
manufacture" (or alternatively, "computer program product") as used herein is intended 

20 to encompass a computer program accessible from any computer-readable device, carrier, 
or media. Of course, those skilled in the art will recognize many modifications may be 

i 

made to this configuration without departing from the scope of the present invention. 
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Those skilled in the art will recognize that any combination of the above 
components, or any number of different components, including computer programs, 
peripherals, and other devices, may be used to implement the present invention, so long 
as similar functions are performed thereby. 

5 

Supply Chain Planning Process 

In the preferred embodiment, the supply chain planning process 112 preferably 
comprises a Markov Decision Process (MDP). MDPs were introduced by Bellman [1], 
and can be abstractly described as follows. The state of a system changes, alternately, by 

10 random transitions and by chosen actions. Before applying a chosen action, if the system 
is in state s 9 then an action x e X(s) (wherein X(s) is the set of possible actions at s ) 
can be taken at a cost of c(x) , and then the state of the system changes to s* = g(s, x) . A 
policy is an assignment of an action x = x(s) to each state s. If, before a random 
transition, the system is in state s y then the subsequent state is a random variable S whose 

1 5 probability distribution depends only on s: ?r(S = s' \ s) = p(s, s' ) . Given an initial state 

s, a policy induces a sequence of random variable costs C l9 C 2 , . . . corresponding to the 
(a priori random) actions X 1 , X 2 , . . . mandated by the policy, i.e., C i — c{X l ) . The 

total discounted cost is a random variable C = J]* X Q where 0 < A, < 1 is a given 

constant, called a discount factor. An optimal policy is one that minimizes the expected 
20 total discounted cost E[C] for any given initial state. The value of a state s is the 
minimum of E[C] , starting at s (and applying a random transition first) over all the 
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possible policies. The value function L(s) assigns to each state s the minimum possible 
expected total discounted cost, where the initial state is s 9 and the system first undergoes 
a random transition, followed by transition caused by a chosen action. 

The size of the state space is a major obstacle for the practicality of MDP. If the 
5 optimal policy has to be computed explicitly in advance, then obviously, the number of 
states should be quite limited. Usually, it suffices to compute the value function, but still, 
if all the possible states have to be handled explicitly, then the number of states must be 
limited to several millions. This number is quite small when the state space is generated 
by some state variables. In particular, if the state variables are continuous, then any 

10 reasonable discretization of them would give rise to an enormous number of states that 
would prohibit solution by the standard methods of discrete MDP (see Puterman [4]). 

Other approaches to the problem are surveyed in Gordon [3], The model 
discussed in Gordon [3] and the solution approach rely heavily on linear programming. 
They, however, should not be confused with other applications of linear programming 

15 methods for solving MDP, either exactly or approximately, for example (see, Trick and 
Zin [5]). 

The present invention considers a model where the action space and the state 
space are continuous and related to each other through a system of linear constraints. 
This is the case in real life systems of supply chain management. It can be shown that the 
20 value function is convex and this important characteristic can be exploited for efficiently 
"learning" the value function in advance and representing it in a way that allows for real- 
time choice of actions based on it. The function is approximated both from above and 
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from below by piecewise linear and convex functions. The domains of linearity of these 
functions are not stored explicitly, since such a representation would prohibit solutions in 
high dimension. Yet, linear programming formulation allows optimizing and updating 
these functions in real-time. 

5 

The Underlying Process 

The specific process considered herein is described as follows. States are 
described by real m -vectors, and actions are described by real n -vectors. At a state s , 
the system first undergoes a transition to a state s + b where b is a random vector from a 
10 certain probability distribution over 5R m . Furthermore, b is (stochastically) independent 
of j. The action x e 9i n and the state s'e 9T of the system after x is taken must satisfy 
the following constraints: 

s ! = s + b-Ax 
Ms' >a and Dx>d 

15 where A e $R mxn ? be and M 9 D, a and d are arrays of consistent dimensions. Thus, 

y is determined by x and s. Taking an action x costs c T x (where the superscript T 
stands for matrix transposition). The cost vector c may itself be drawn from a 
probability distribution. 

Consider the following example. One application of the model is in terms of a 
20 dynamic production problem. Suppose a manufacturer optimizes profit by minimizing 

c x subject to Ax < q . This is the classical (one stage) product mix problem. This 
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problem generalizes as follows. Suppose, after each production stage, the vector of 
available resources q is updated. More precisely, suppose, at the end of a production 
stage, the leftover amounts of resources are given in a vector s. Next, additional 
amounts, given in a vector b , become available, so the resources available for the next 
5 stage are given in the vector s + b . If the new production plan is given by x , then the 
leftover amounts after the next stage are given by s 1 , where Ax + s'=s + b. In addition 
to that, x and s* may have to satisfy some linear constraints. A myopic approach to the 
problem would choose x so as to minimize c T x subject to Ax + s x =s + b 9 Ms % >a and 
Dx>d, ignoring the value of leftover resources for the future. If the value function ZQ 
10 is known, then an optimal policy would minimize in each stage the objective function 
c x + AZ(s') subject to the above constraints. 

For any state s 9 let F(s) denote the set of all the pairs (x, s' ) such that Ax + s' = s, 
Ms' > a and Dx > d. The Bellman [1] optimality equation characterizes the value 
function: 

15 L(s) = E bc [min{c T x + AL(s') \ (x,s') e F(s + b)}] 

where E^ c denotes the expectation operator with respect to the distribution of the change 
vector b and the cost vector c. 

Convexity of the Value Function 
20 In this section, it will be proved that the function ZQ is convex. 
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All 

Let any sequence b , b , Z> . . of realizations of the state change vectors 
(corresponding to the time stages) be fixed, let any sequence c°, c\ c 2 ,. of realizations 
of the cost vectors (corresponding to the time stages) be fixed, and let an optimal policy 
be fixed. Given an initial state s° and assuming the first change is by a random transition, 
5 the future (i.e., the states and actions) is now completely determined and the total 
discounted cost is equal to: 

V(s 0 - i b,c) = f i M(c i ) T x i 

0 12 

where x , x , x are the action vectors, necessarily satisfying Dx l > d. Furthermore, a 
sequence of state vectors s f s , s is also determined, which satisfies 
10 Ax l + s l+l =s l +b* and Ms l >a . Another initial state t° would generate different 
sequences of action vectors y°,y\y 2 ,... and state vectors t°, t\ t 2 ,... so that: 

F(^M = Z^V)V" 

1=0 

and Ay 1 + t l+l = + Aft 1 ' > a and £>y > rf. Given any 0 < a < 1 , consider the initial state 
(1-a) s° +a t°. The sequence of action vectors z l = (l-a) x l + ay 1 and state vectors 
15 u l = (l-a) s l + a / also satisfy the constraints: ^ W +1 = w' + b\ Mu l ' >a and Dz l > d, 
and the resulting total discounted cost is equal to: 

^(c^z 1 = (l-a)V(s 0 ;b,c) + aV(t°;b y c^ 

i=0 

By taking expectations over b and c on both sides of the latter, 
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00 

£[^^V) r z f ] < 1(5°) + is obtained. The value on the left-hand side 

corresponds to a policy that may not be stationary, but since there exists an optimal 
stationary policy, it follows that its value cannot be smaller than L(u°). Thus, 
L(u°) < (I- a) L(s°) + aL(t°). This completes the proof of convexity. 
5 In view of the convexity of L(s\ it can be efficiently approximated both from 

above and from below. Both approximations can be repeatedly refined both during pre- 
processing and online. 



Approximating the Value Function from Above 
10 The approximation from above is based on knowledge of upper bounds on L(s) at 

each member of some manageable set of states. Suppose it has been concluded that for 
certain state vectors u\... 9 u k , L{ii) < fi(i= 1 k). Then, convexity implies that for any 

nonnegative^=(yi,...,^) r such that = 1, necessarily, x(^^ w 0 ^ Sti^^'* 
Thus, for any given state s, the least upper bound on L(s) that can be derived from 
15 /= (f\ , . . . , fkf by convexity, which is denoted by L / (s) , can be obtained by solving a 
linear programming problem as follows. 

Denote U = [w 1 ,..., u k ] e 9t mxk mde = (l,...,l) r e Then, solve: 

Minimize fy 

y 

20 subject to Uy = s (PI) 

e T y- 1 
y > 0 
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A nice feature of this approximation is that there is no need to derive such least 
upper bounds ahead of time. The problem (PI) can be embedded in a larger linear 
programming problem that determines an approximately optimal action. Specifically, if 
the upper bounds are sufficiently close to L, then an approximately optimal action at a 
5 state s + b can be calculated by solving the following: 



The problem (P2) can be solved for any choice of the vectors b and c. If the latter are 
15 sampled sufficiently many times from the given probability distribution, then an updated 
approximate value function can be obtained for L(s) as follows. Denote by H(s;b,c) the 
optimal value of (P2), and denote by J(s;b,c) the minimum of c T x + XLis*) subject to 
{x, s') e F(s+b). Obviously, J(s;b,c) <H(s;b,c) f so L(s) - E bc [J(s;b,c)] < E bc [H(s;b,c)]. 

is an adequate random sample, consider the values H(s;b l ,c l ), and based 
20 on them calculate a high-confidence upper bound rj on Eb[J(s;b,c)]. 

If 77 < Lf(s), then s can be added to the set of states, by setting ^+1=^ and ^+1=77. When 
desired, some s l may be dropped when their corresponding values are implied by the 
values of other states, as evident by solving a problem like (PI). 



Minimize c T x + Xfy 

subject to A x+Uy = s + b 
MUy > a 



(P2) 



10 



Dx > d 
e T y =1 
y >0 



25 
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Convergence 

Now, it can be proven that if the approximation from above cannot be improved 
by deriving more bounds from solutions of (P2), then the current approximation is exact. 
Thus, suppose for every state s: 

L(s)^E bc [H(s;b,c)]>L f (s) 
It follows that for the states u l in the current formulation: 

and for any state s f , if in an optimal solution of {P2\ s J "=Uy, e T y=l, mdy > 0, then 
also: 

i 

It follows that the function L satisfies the Bellman equation: 

L(s) = E bc [min{c T x + Al(s')}] 

Let x (s;b,c) and s ! (s;b,c) denote optimal choices at s + b, given c, relative to the 

function L and let x{s\b,c) and s* (s\b,c) denote the optimal choices at s + b relative to 
the function Z. Thus: 

L(s) = E bc [c T x(s; b,c) + X L(s* 0; b, c)] < E bc [c T x(s; b,c) + X Lis 1 (s; b 9 c)] 

and: 

L(s) = E bc [c T x(s; b, c) + XL(s' (s; b, c)] 
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It follows that: 

0 < L(s) - L(s) < E bc [c T x(s;b y c) + XL{s % (s\b,c)\ 
" E bc [c T x(s;b, c) + AL(s'(s;b,c)] 

= A(^[I(j , («^c)]- J E fc [Z(^(j;6,c)]) 

5 Assuming the state space is bounded, the latter implies L(s) = L(s) , since otherwise a 

contradiction is obtained by choosing s so that L(s) - L(s) is sufficiently close to the 
supremum of this difference over the state space. 



Approximating the Value Function from Below 
1 0 The approximation from below is based on knowledge of linear functions that lie 

below the convex function L(s). Suppose v\...y e i# m and g u ... 9 g r e i#are such that 
for every s that satisfies Ms >a: 

L(s)>l i (s)^(v i ) T s^g i (i = l^r) 
Then, the maximum L(s) = max/ {h(s)} is a convex function that bounds L(s) from 

15 below. 



An Alternate Value Function 

Because of the alternating nature of the process, i.e., alternating between chosen 
actions and random transitions, there is an alternative way to define values of states, 

20 which turns out to be helpful in the case of approximation from below. First, denote: 

< 

K(s;c) = mm{c T x + AL(s') (x,s')eF(s)} 
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so that: 

L(s) = E bc [K(s + b;c)] 

It is easy to see that K(s;c) is the minimum possible expected total discounted cost, given 
c, when the starting state is s and the first change is induced by a chosen action rather 
5 than by a random transition. The Bellman [1] optimality equation, therefore, can be 
written in terms of K in the form: 

K(s;c) = mm{c r x-f %E bc {K(s } +b\c')} | (x,s r ) e F(s)} 

Convexity 

10 The proof of convexity of the function K(s;c) (for a fixed c) is similar to that of 

L(s). Fix any sequence b°, b\b 2 ,... of realizations of the state change vectors (except that 
b° - 0), fix any sequence c°, c\ c 2 , ...of cost vectors, and fix an optimal policy. An initial 
state s° now determines the future states and actions, and therefore: 

; 6, c) = j>] X (c' ) T x i 

1 5 where the action vectors satisfy Dx l >d,Ax x + s l+l =s l + b l and Dx l > d. Another initial 
state t° has: 

1=0 

with Ay 1 + / +1 = i + b', Mt > a, and Dy l > d. It can be deduced that: 

F((l - a)s° + at 0 ; b, c) < (1 - a)V(s° ; b, c) + aV(t° ; b, c) 
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for every 0 < a < 1, and convexity is established by taking expectations over the 
sequences of b l s and d s. 



Bounds From Duality 

5 Now, the lower bounds on K(s;c) can be derived based on the linear functions 

lis). Denote V T = [v 1 ,..., /] e m mx \ g = (g u ...&f and e = (l,...,lf * # r . 
Let be scalar, and denote the optimal value of the following linear programming 

problem by K(s;c) : 

Minimize c x + X% 
10 x,s',4 

subject to Ax + s' = s (P3) 
Ms' > a 
Dx > d 
%e-Vs' > g 

15 

Obviously, at an optimal solution, £ has the value of 

L(s') = max t {(v'fs' + gi}<L(s'). It follows that£(s;c) <K(s;c), but further 
information can be derived by considering the dual problem of (P3), i.e.: 



rrt rrt rrt 

20 Maximize s y + a z + d w + g p 

subject to A T y +D T w = c 

y + M T z-V T p=0 {D3) 

T 

e p = X 

25 z, w,p>0 
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The function K(s;c) is piecewise linear and convex since the feasible domain of (D3) is 
independent of s, and the maximum is always attained at a feasible basis; so, K(s ; c) is 
the maximum of linear functions of s corresponding to feasible bases of (D3). Let 
y =y(s ;c), z = z(s ;c), w = w(s ;c) and p=p(s ;c) denote the optimal solution of (Z)3) 
5 when s =S . Since this solution is feasible for every s, it follows that for every s 9 
because K(s;c) is also the maximum value (D3), the following holds: 

K(s; c) > K(s; c)>y T s + a T z + d T w + g T p 
Replacing s by s+b 9 the latter can also be expressed by stating that for every s and b: 

K(s + b;c)> y T s 4- y T b 4- a T z + d T w + g T p 
1 0 Now, let v r+1 - E c \y\ and let: 

Sr+i =E bc [y T b] + a T z + d T w+g T p 

so that: 

£j^+*;c)] = (v r+1 ) r *+g r+1 

Since L(s) = E b [K{s + b;c)], it follows that L(s) >l r+ \(s) = (v r+l fs + g r+u 

T 

15 If b and c are stochastically independent, then the expectation Ebjy b] can be 

easily calculated as £ e [ y . ]E [b l ] •> and each 2?[&J can be estimated in advance. 

If L (s) is sufficiently close to L(s) y then an approximately optimal action at a state 
s + b can also be calculated by solving (P3). 

20 Convergence 
Denote: 
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Lr(s) = max {li(s)j 
\<i<r 

and suppose the (r + l)th linear function is obtained while s is the next state that is 
5 calculated while solving (P3), i.e.: 

and l r +i(s ;s)is the optimal value of (P3). Then: 

Lr+\(s) = max { L r (s) , / r+ i(s) } 

Obviously, as long as there exist s and s such that l r +\(s; s)>Lr (s), then a better 
10 approximate value function L r +\ (s) can be obtained. Otherwise, further iterations from 

below would result in the same approximate value function Lr (s). 

Denote L(s) = Etc[K(s + b;c)] and suppose that (P3) cannot improve the 

approximation from below at any state, that is, for every s, L(s) < Lr (s). It can now be 

shown that in this case Lr (s) is indeed the correct value L(s). By definition, K(s;c) is 
1 5 obtained by solving (P3), which also determines a certain policy x = x(s;c) and a certain 

next state s' = s'(s; c). Denote by n(s \c) the expected total discounted cost, starting at s 

with cost vector c (just before an action has to be chosen) and using the policy x(.). Then, 

for all s: 

n(s;c) = c x(s) + XEb C >[ n(s' (s;c) + b ; c 1 ) ] 

20 It follows that for all s: 

L r (s) - E c [x(s; c)]>X- E bc [L r (s' (s; c) + b)- n(s> (j; c) + b; c)} 
On the other hand, since L(s) is the optimal value, for all s : 

E c [x(s;c)}>L{s)>L r (s) 

19 

ARC-00-0030-US1 



It follows that: 

o>z r (/)-i^V^ 

where S 1 denotes the state that is reached after i steps, while implementing the policy 
*(.), so the expectation is taken over all the random parameters involved. If the state 
5 space is bounded, the sequence E [Lj{S ) - n(S c)] is also bounded and, if X < 1, it 
follows that Lr(s°) = E c [%(s° ;c)] - L(s°). 

Learning The Value Function 

The value function can be approximated successively. Iterations of the successive 
10 approximation algorithm can be executed both during the preprocessing phase, while 

simulating random state transitions, and also while the MDP itself is running in real time. 
In such simulations, actions are chosen based on the currently known approximations 
from above and from below. It is important to note that, for the sake of running the MDP 
optimally, there is no need to know the value function in its entirety. It suffices to know 
15 only the values of states that actually occur in the process and those that could occur if 
certain actions were taken, and even so, only the relative values are important for 
choosing the best action. In other words, sometimes values do not have to be known with 
high accuracy. Furthermore, knowledge of a lower bound on the value of one state and 
an upper bound on the value of another state may suffice for choosing the optimal action. 
20 Suppose the system is currently in state s just before an action has to be chosen. 

An action can be chosen based on the approximation of L(s') from above by solving (P2) 
or based on the approximation of L( s 1 ) from below by solving (P3). Moreover, a 
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weighted average of the two approximate value function can be used by solving the 



following linear programming problem for some constant J3 and even analyze the 



sensitivity of the solution to /?: 



Minimize 
subject to 



c T x + pXf T y + {\- P)X% 



5 



10 



Ax + Uy = s 
%e - VMUy > g 
MUy > a 
Dx > a 
e T y = l 
y>0 



(P4) 



Suppose an action x is calculated as an optimal solution of one of the possible linear 
programming problems. Before an action x is actually executed, the resulting state s' can 

15 be further evaluated by running simulations with s' as the initial state and using any of 
the approximate value functions for computing actions. An approximate value function 
determines some policy, not necessarily optimal, so the simulation results provide an 
upper bound on the value of s 1 . Such a bound can be compared to the one derived by 
convexity from previously known states, and the state s' may be added to the list of 

20 states that are used for representing the approximation from above. 

Logic of the Preferred Embodiment 

FIG. 2 is a flowchart that illustrates the general logic of a message or event-driven 
supply chain planning process 112 according to the preferred embodiment of the present 
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invention. Specifically, the logic indicates how the supply chain planning process 112 
derives optimal policies during its operation. 

Block 200 represents the supply chain planning process 112 accessing a vector s 
of available resources, e.g., from a database 116, from another system, from a previous 
5 cycle of this process 112, etc. Note that supply chain management decisions are made 
cyclically, and thus Block 200 may represent the process 112 obtaining a vector of 
leftover resources from a previous cycle. 

Block 202 represents the supply chain planning process 112 accessing a resource 
change vector b and a cost vector c, e.g., from a database 1 16, from another system, from 
10 a previous iteration of this process 1 12, etc. The vector b comprises additional resources 
that became available since the last cycle, and this vector b is added to the vector s. 

Block 204 represents the supply chain planning process 112 employing a linear 
programming (LP) solver to compute a new action vector Specifically, the LP solver 
uses an approximate value function in a linear programming formulation to 
1 5 determine an action x 9 e.g., how much to produce from the available resources s. The 
problem is to determine the action*, not just to maximize the immediate profits by 

minimizing c T x , but to take into account the value of the vector s of available resources 
remaining in anticipation of arrival of more resources. 

As noted above, both the state space (i.e., the vector s) and the action space (i.e., 
20 X(s)) are continuous. Moreover, the state space and the action space are related to each 
other through a system of linear constraints, wherein one or more actions in the action 
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space that are feasible in a state of the state space are linearly constrained with respect to 
the state. 

Specifically, this Block builds one or more approximations from above and from 
below to a value function L(s) for the state s using representations that facilitate the 
5 computation of approximately optimal actions at any given state by linear programming. 
These approximations can be repeatedly refined in an iterative manner. The value 
function L(s) is convex, which means that it can be efficiently learned in advance and can 
be represented in a way that allows for real-time choice of actions based on it. Once the 
value function L(s) is approximated, an action x can be selected. 
1 0 Block 206 represents the supply chain planning process 1 12 executing the actions 

described by the vector x, wherein the vector s f represents the leftover resources 
remaining after the actions described by the vector x have been performed. 

Block 208 represents the supply chain planning process 112 setting the vector s to 
the vector s f for the next cycle. Thereafter, control transfers back to Block 200. 

15 
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5 

Conclusion 

This concludes the description of the preferred embodiment of the invention. In 
summary, the present invention discloses a method, apparatus, and article of manufacture 
for solving stochastic control problems of linear systems in high dimensions by modeling 

10 a structured Markov Decision Process (MDP). A state space for the MDP is a 

polyhedron in a Euclidean space and one or more actions that are feasible in a state of the 
state space are linearly constrained with respect to the state. One or more approximations 
are built from above and from below to a value function for the state using 
representations that facilitate the computation of approximately optimal actions at any 

1 5 given state by linear programming. 

The foregoing description of the preferred embodiment of the invention has been 
presented for the purposes of illustration and description. It is not intended to be 
exhaustive or to limit the invention to the precise form disclosed. Many modifications 
and variations are possible in light of the above teaching. It is intended that the scope of 

20 the invention be limited not by this detailed description, but rather by the claims 
appended hereto. 
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WHAT IS CLAIMED IS: 

1. A method for solving, in a computer, stochastic control problems of linear 
systems in high dimensions, comprising: 

(a) modeling, in the computer, a structured Markov Decision Process (MDP), 

5 wherein a state space for the MDP is a polyhedron in a Euclidean space and one or more 
actions that are feasible in a state of the state space are linearly constrained with respect 
to the state; and 

(b) building, in the computer, one or more approximations from above and from 
below to a value function for the state using representations that facilitate the 

10 computation of approximately optimal actions at any given state by linear programming. 

2. The method of claim 1, wherein the MDP comprises a supply chain 
planning process. 

15 3. The method of claim 1 , wherein the action space and the state space are 

continuous and related to each other through a system of linear constraints. 



method further comprises efficiently learning the value function in advance and 
20 representing the value function in a way that allows for real-time choice of actions based 



4. 



The method of claim 1, wherein the value function is convex and the 



thereon. 
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5. The method of claim 1, wherein the linear function is approximated both 
from above and from below by piecewise linear and convex functions. 



6. The method of claim 5, wherein the domains of linearity of the piecewise 
5 linear and convex functions are not stored explicitly, but rather are encoded through a 

linear programming formulation. 

7. The method of claim 5, wherein the domains of linearity of the piecewise 
linear and convex functions allow the functions to be optimized and updated in real-time. 

10 

8. The method of claim 1, wherein the value function can be efficiently 
approximated both from above and from below. 



9. The method of claim 1, wherein the approximations can be repeatedly 

15 refined. 



10. The method of claim 1, wherein the value function can be efficiently 
approximated from above based on knowledge of upper bounds on the function at each 
member of a selected set of states. 

20 
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1 1 . The method of claim 1, wherein the value function can be efficiently 
approximated from below based on linear functions that lie below the convex value 
function. 

12. The method of claim 1, wherein the value function can be approximated 
successively. 

13. A computerized apparatus for solving stochastic control problems of linear 
systems in high dimensions, comprising: 

(a) a computer; 

(b) logic, performed by the computer, for modeling a structured Markov Decision 
Process (MDP), wherein a state space for the MDP is a polyhedron in a Euclidean space 
and one or more actions that are feasible in a state of the state space are linearly 
constrained with respect to the state; and 

(c) logic, performed by the computer, for building one or more approximations 
from above and from below to a value function for the state using representations that 
facilitate the computation of approximately optimal actions at any given state by linear 
programming. 

14. The apparatus of claim 13, wherein the MDP comprises a supply chain 
planning process. 
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15. The apparatus of claim 13, wherein the action space and the state space are 
continuous and related to each other through a system of linear constraints. 

16. The apparatus of claim 13, wherein the value function is convex and the 
5 logic further comprises efficiently learning the value function in advance and 

representing the value function in a way that allows for real-time choice of actions based 
thereon. 



10 both from above and from below by piecewise linear and convex functions. 

1 8. The apparatus of claim 1 7, wherein the domains of linearity of the 
piecewise linear and convex functions are not stored explicitly, but rather are encoded 
through a linear programming formulation. 



19. The apparatus of claim 17, wherein the domains of linearity of the 
piecewise linear and convex functions allow the functions to be optimized and updated in 
real-time. 



17. The apparatus of claim 13, wherein the linear function is approximated 



15 



20 



20. The apparatus of claim 13, wherein the value function can be efficiently 



approximated both from above and from below. 
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21. The apparatus of claim 13, wherein the approximations can be repeatedly 

refined. 



22. The apparatus of claim 13, wherein the value function can be efficiently 
5 approximated from above based on knowledge of upper bounds on the function at each 

member of a selected set of states. 

23. The apparatus of claim 13, wherein the value function can be efficiently 
approximated from below based on linear functions that lie below the convex value 

10 function. 

24. The apparatus of claim 13, wherein the value function can be 
approximated successively. 



15 25. An article of manufacture embodying logic for solving stochastic control 

problems of linear systems in high dimensions, the logic comprising: 

(a) modeling a structured Markov Decision Process (MDP), wherein a state space 
for the MDP is a polyhedron in a Euclidean space and one or more actions that are 
feasible in a state of the state space are linearly constrained with respect to the state; and 

20 (b) building one or more approximations from above and from below to a value 

function for the state using representations that facilitate the computation of 
approximately optimal actions at any given state by linear programming. 



29 



ARC-00-0030-US1 




26. The article of manufacture of claim 25, wherein the MDP comprises a 
supply chain planning process. 

5 27. The article of manufacture of claim 25, wherein the action space and the 

state space are continuous and related to each other through a system of linear 
constraints. 

28. The article of manufacture of claim 25, wherein the value function is 
convex and the logic further comprises efficiently learning the value function in advance 
and representing the value function in a way that allows for real-time choice of actions 
based thereon. 

29. The article of manufacture of claim 25, wherein the linear function is 
approximated both from above and from below by piecewise linear and convex functions. 

30. The article of manufacture of claim 29, wherein the domains of linearity 
of the piecewise linear and convex functions are not stored explicitly, but rather are 
encoded through a linear programming formulation. 

20 
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3 1 . The article of manufacture of claim 29, wherein the domains of linearity 
of the piecewise linear and convex functions allow the functions to be optimized and 
updated in real-time. 

5 32. The article of manufacture of claim 25, wherein the value function can be 

efficiently approximated both from above and from below. 

33. The article of manufacture of claim 25, wherein the approximations can be 
repeatedly refined. 

10 

34. The article of manufacture of claim 25, wherein the value function can be 
efficiently approximated from above based on knowledge of upper bounds on the 
function at each member of a selected set of states. 

15 35. The article of manufacture of claim 25, wherein the value function can be 

efficiently approximated from below based on linear functions that lie below the convex 
value function. 

36. The article of manufacture of claim 25, wherein the value function can be 
20 approximated successively. 
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ABSTRACT OF THE DISCLOSURE 



Stochastic control problems of linear systems in high dimensions are solved by 
modeling a structured Markov Decision Process (MDP). A state space for the MDP is a 
polyhedron in a Euclidean space and one or more actions that are feasible in a state of the 
5 state space are linearly constrained with respect to the state. One or more approximations 
are built from above and from below to a value function for the state using 
representations that facilitate the computation of approximately optimal actions at any 
given state by linear programming. 



I hereby certify that this paper or fee is beirtfl de- 
posited with the United States Postal Service "Express 
Post Office to Addressee" service tinder 37 Cfft 1.10©* 
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